Python for Bioinformatics

This Jupyter notebook is intented to be used alongside the book Python for Bioinformatics

Chapter 19: Filtering Out Specific Fields from a GenBank File

Note: Before opening the file, this file should be accesible from this Jupyter notebook. In order to do so, the following commands will download these files from Github and extract them into a directory called samples.


In [1]:
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/samples.tar.bz2 -o samples.tar.bz2
!mkdir samples
!tar xvfj samples.tar.bz2 -C samples


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.5M  100 16.5M    0     0  22.0M      0 --:--:-- --:--:-- --:--:-- 22.0M
mkdir: cannot create directory 'samples': File exists
BLAST_output.xml
TAIR7_Transcripts_by_map_position.gz
pMOSBlue.txt
fishbacteria.csv
UniVec_Core.nsq
t3beta.fasta
PythonU.db
input4align.dnd
pdb1apk.ent.gz
readme.txt
contig1.ace
example.aln
hsc1.fasta
bioinfo/seqs/15721870.fasta
primers.txt
bioinfo/seqs/4586830.fasta
bioinfo/seqs/7638455.fasta
GSM188012.CEL
3seqs.fas
sampleX.fas
sampleXblast.xml
B1.csv
phd1
conglycinin.phy
bioinfo/seqs/218744616.fasta
spfile.txt
bioinfo/seqs/513419.fasta
bioinfo/seqs/513710.fasta
prot.fas
cas9align.fasta
seqA.fas
bioinfo/seqs/
bioinfo/
pdbaa
other.xml
vectorssmall.fasta
t3.fasta
a19.gp
data.csv
input4align.fasta
B1IXL9.txt
fasta22.fas
bioinfo/seqs/7415878.fasta
bioinfo/seqs/513718.fasta
bioinfo/seqs/513719.fasta
bioinfo/seqs/6598312.fasta
UniVec_Core.nin
Q5R5X8.fas
bioinfo/seqs/513717.fasta
BcrA.gp
bioinfo/seqs/2623545.fasta
bioinfo/seqs/63108399.fasta
conglycinin.dnd
NC2033.txt
fishdata.csv
uniprotrecord.xml
BLAST_output.html
Q9JJE1.xml
test3.csv
UniVec_Core.nhr
sampledata.xlsx
UniVec_Core
NC_006581.gb
conglycinin.multiple.phy
conglycinin.fasta

Listing 19.1: genbank1.py: Extract sequences from a Genbank file


In [2]:
from Bio import SeqIO, SeqRecord, Seq
from Bio.Alphabet import IUPAC

GB_FILE = 'samples/NC_006581.gb'
OUT_FILE = 'nadh.fasta'
with open(GB_FILE) as gb_fh:
    record = SeqIO.read(gb_fh, 'genbank')
seqs_for_fasta = []
for feature in record.features:
    # Each Genbank record may have several features, the program
    # will walk over all of them.
    qualifier = feature.qualifiers
    # Each feature has several parameters
    # Pick selected parameters.
    if 'NADH' in qualifier.get('product',[''])[0] and \
    'product' in qualifier and 'translation' in qualifier:
        id_ = qualifier['db_xref'][0][3:]
        desc = qualifier['product'][0]
        # nadh_sq is a NADH protein sequence
        nadh_sq = Seq.Seq(qualifier['translation'][0], IUPAC.protein)
        # 'srec' is a SeqRecord object from nadh_sq sequence.
        srec = SeqRecord.SeqRecord(nadh_sq, id=id_, description=desc)
        # Add this SeqRecord object into seqsforfasta list.
        seqs_for_fasta.append(srec)
with open(OUT_FILE, 'w') as outf:
    # Write all the sequences as a FASTA file.
    SeqIO.write(seqs_for_fasta, outf, 'fasta')

Listing 19.2: genbank2.py: Extract upstream regions


In [3]:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

GB_FILE = 'samples/NC_006581.gb'
OUT_FILE = 'tg.fasta'
with open(GB_FILE) as gb_fh:
    record = SeqIO.read(gb_fh, 'genbank')
seqs_for_fasta = []
tg = (['cox2'],['atp6'],['atp9'],['cob'])
for feature in record.features:
    if feature.qualifiers.get('gene') in tg and feature.type=='gene':
        # Get the name of the gene
        genename = feature.qualifiers.get('gene')
        # Get the start position
        startpos = feature.location.start.position
        # Get the required slice
        newfrag = record.seq[startpos-1000: startpos]
        # Build a SeqRecord object
        newrec = SeqRecord(newfrag, genename[0] + ' 1000bp upstream',
                           '','')
        seqs_for_fasta.append(newrec)
with open(OUT_FILE,'w') as outf:
    # Write all the sequences as a FASTA file.
    SeqIO.write(seqs_for_fasta, outf, 'fasta')